Today we will…
Recall from your statistics classes…
A random variable is a value we don’t know until we take a sample.
The distribution of a random variable tells us its possible values and how likely they are.
Uniform Distribution
Normal Distribution
t-Distribution
Chi-Square Distribution
Binomial Distribution
r is for random sampling.
p is for probability.
x.q is for quantile.
q functions are “backwards” of the p functions.d is for density.
Probability of exactly 12 heads in 20 coin tosses, with a 70% chance of tails?
Empirical: the observed data.
We can generate fake data based on the assumption that a variable follows a certain distribution.
Since there is randomness involved, we will get a different result each time we run the code.
To make a reproducible random sample, we first set the seed:
set.seed(435)
fake_data <- tibble(names = charlatan::ch_name(1000),
height = rnorm(1000, mean = 67, sd = 3),
age = runif(1000, min = 15, max = 75),
measure = rbinom(1000, size = 1, prob = 0.6)) |>
mutate(supports_measure_A = ifelse(measure == 1, "yes", "no"))
head(fake_data)# A tibble: 6 × 5
names height age measure supports_measure_A
<chr> <dbl> <dbl> <int> <chr>
1 Elbridge Kautzer 67.4 66.3 1 yes
2 Brandon King 65.0 61.5 0 no
3 Phyllis Thompson 68.1 53.8 1 yes
4 Humberto Corwin 67.5 33.9 1 yes
5 Theresia Koelpin 71.4 16.1 1 yes
6 Hayden O'Reilly-Johns 66.2 37.0 0 no
Check to see the ages look uniformly distributed.
map()Write a function to simulate height data for a population with some average and SD height.
Simulate multiple datasets with different average heights and SDs of height.
# A tibble: 6 × 4
mean_ht std_ht person ht
<dbl> <dbl> <int> <dbl>
1 60 3 1 57.2
2 60 3 2 56.0
3 60 3 3 55.3
4 60 3 4 59.7
5 60 3 5 61.5
6 60 3 6 58.7
sample()Suppose you want to take a random sample of values:
sample_n()Suppose you want to take a random sample of observations from your data:
Suppose there is a group of 50 people. Simulate the approximate probability at least two people have the same birthday (i.e. the same day of the year not considering year of birth; also ignore the leap year).
Suppose there is a group of 50 people. Simulate the approximate probability at least two people have the same birthday (i.e. the same day of the year not considering year of birth; also ignore the leap year).
We can automatically include code output in the written portion of a Quarto document using `r `.
Type this: My random number is `r my_rand`.
To get this: My random number is 0.5819984.
Today we will…
This dataset contains a random sample of 1,000 births from North Carolina in 2004 (sampled from a larger dataset).
library(openintro)
data(ncbirths)
slice_sample(ncbirths, n = 10) |>
knitr::kable() |>
kableExtra::scroll_box(height = "400px") |>
kableExtra::kable_styling(font_size = 30)| fage | mage | mature | weeks | premie | visits | marital | gained | weight | lowbirthweight | gender | habit | whitemom |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 22 | younger mom | 39 | full term | 5 | not married | 10 | 7.19 | not low | female | nonsmoker | white |
| NA | 20 | younger mom | 39 | full term | 16 | not married | 19 | 5.44 | low | female | nonsmoker | white |
| 34 | 31 | younger mom | 39 | full term | 15 | married | 22 | 8.69 | not low | male | nonsmoker | white |
| 37 | 37 | mature mom | 41 | full term | 17 | married | 45 | 9.88 | not low | male | nonsmoker | white |
| 29 | 29 | younger mom | 44 | full term | 12 | married | 51 | 8.50 | not low | male | nonsmoker | white |
| 22 | 22 | younger mom | 40 | full term | 15 | not married | 31 | 7.06 | not low | female | nonsmoker | white |
| 24 | 22 | younger mom | 39 | full term | 11 | married | 27 | 6.00 | not low | male | smoker | white |
| 24 | 22 | younger mom | 39 | full term | 15 | not married | 45 | 7.50 | not low | male | nonsmoker | not white |
| 35 | 25 | younger mom | 37 | full term | 30 | married | 60 | 3.94 | low | female | nonsmoker | white |
| 30 | 34 | younger mom | 40 | full term | 10 | married | 22 | 9.31 | not low | male | nonsmoker | white |
In statistical models, we generally have one variable that is the response and one or more variables that are explanatory.
The scatterplot has been called the most “generally useful invention in the history of statistical graphics.”
Characterizing Relationships
How would you characterize this relationship?
Note: Much of what we are doing at this stage involves making judgment calls!
We often assume the value of our response variable is some function of our explanatory variable, plus some random noise.
\[response = f(explanatory) + noise\]
If we assume the relationship between \(x\) and \(y\) takes the form of a linear function…
\[ response = intercept + slope \times explanatory + noise \]
We use the following notation for this model:
Population Regression Model
\(Y = \beta_0 + \beta_1 X + \varepsilon\)
where \(\varepsilon \sim N(0, \sigma)\) is the random noise.
Fitted Regression Model
\(\hat{y}= \hat{\beta}_0 + \hat{\beta}_1 x\)
where \(\hat{}\) indicates the value was estimated.
Regress baby birthweight (response variable) on the pregnant parent’s weight gain (explanatory variable).
When visualizing data, fit a regression line (\(y\) on \(x\)) to your scatterplot.
The lm() function fits a linear regression model.
lm object.# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 6.63 0.111 59.6 0
2 gained 0.0161 0.00332 4.86 0.00000135
The difference between observed (point) and expected (line).
# A tibble: 3 × 9
.rownames weight gained .fitted .resid .hat .sigma .cooksd .std.resid
<chr> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 7.63 38 7.24 0.387 0.00133 1.47 0.0000458 0.262
2 2 7.88 20 6.95 0.927 0.00157 1.47 0.000311 0.630
3 3 6.63 38 7.24 -0.613 0.00133 1.47 0.000115 -0.416
There are four conditions that must be met for a linear regression model to be appropriate:
Is the relationship linear?
Are the observations independent? Difficult to tell!
What does independence mean?
Should not be able to know the \(y\) value for one observation based on the \(y\) value for another observation.
Independence violations:
Are the residuals normally distributed?
Less important than linearity or independence:
Do the residuals have equal (constant) variance?
Sum of Square Errors (SSE)
Root Mean Square Error (RMSE)
Call:
lm(formula = weight ~ gained, data = ncbirths)
Residuals:
Min 1Q Median 3Q Max
-6.4085 -0.6950 0.1643 0.9222 4.5158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.63003 0.11120 59.620 < 2e-16 ***
gained 0.01614 0.00332 4.862 1.35e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.474 on 971 degrees of freedom
(27 observations deleted due to missingness)
Multiple R-squared: 0.02377, Adjusted R-squared: 0.02276
F-statistic: 23.64 on 1 and 971 DF, p-value: 1.353e-06
R-squared
Regress baby birthweight on…
… gestation weeks.
Why does it make sense that the left model is better?
When fitting a linear regression, you can include…
…multiple explanatory variables.
lm(y ~ x1 + x2 + x3 + ... + xn)
…interaction terms.
lm(y ~ x1 + x2 + x1:x2)
lm(y ~ x1*x2)
…quadratic relationships.
lm(y ~ I(x1^2) + x1)